Transformations and Constrasts in Multiple Regression

Author

Reinhold Kliegl

1 Setup

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(car)
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some
Code
library(performance)
library(emmeans)
Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'
Code
library(hypr)

2 Transformation after checking model residuals

2.1 Analyses of six-minute run: No need for transformation

2.1.1 Graph

Why 42? The Answer to the Ultimate Question of Life, the Universe, and Everything is 42

Code
set.seed(42)
dat_run <- 
  readRDS("data/EmotikonSubset.rds") |>
  mutate(Sex = fct_recode(Sex, Girls="female", Boys="male"),
         AgeBin = cut(age, breaks = c(7.8, seq(8.2, 9.0, 0.25), 9.3)), 
         cohort = as.numeric(as.character(Cohort))) |> 
  group_by(Test) |> 
  filter(Test=="Run") |>
  group_by(Cohort, Sex) |> 
  sample_n(50) |> 
  ungroup() |> 
  select(Cohort, School, Child, Sex, AgeBin, cohort, age, score)

dat_run |>
  group_by(Cohort, Sex) |> 
  summarise(N=n(), Run_M=mean(score), Run_SD=sd(score))
`summarise()` has grouped output by 'Cohort'. You can override using the
`.groups` argument.
# A tibble: 18 × 5
# Groups:   Cohort [9]
   Cohort Sex       N Run_M Run_SD
   <fct>  <fct> <int> <dbl>  <dbl>
 1 2011   Girls    50  993.   114.
 2 2011   Boys     50 1096.   153.
 3 2012   Girls    50  976.   127.
 4 2012   Boys     50 1057.   151.
 5 2013   Girls    50  963.   146.
 6 2013   Boys     50 1034.   153.
 7 2014   Girls    50  990.   147.
 8 2014   Boys     50 1030.   142.
 9 2015   Girls    50 1003.   105.
10 2015   Boys     50 1036.   135.
11 2016   Girls    50  992.   148.
12 2016   Boys     50 1067.   143.
13 2017   Girls    50  950.   138.
14 2017   Boys     50 1022.   161.
15 2018   Girls    50  969.   130.
16 2018   Boys     50 1013.   136.
17 2019   Girls    50  947.   141.
18 2019   Boys     50 1009.   179.
Code
dat_run |> 
  ggplot(aes(x=age, y=score, group=Sex, color=Sex)) +
  geom_jitter() +
  geom_smooth(method="lm") +
#  geom_smooth(linetype="dashed") +
  scale_colour_manual("", values=c("red", "blue")) +
  scale_x_continuous("Age [year]", breaks=seq(7.75,9.25,.25)) +
  scale_y_continuous("Six-minute run [m]", breaks=seq(500,1500,100)) +
  geom_hline(yintercept=1008.11) +
  coord_cartesian(xlim=c(8, 9.1), ylim=c(500,1500)) +
  theme_bw() + theme(legend.position = "top") 
`geom_smooth()` using formula = 'y ~ x'

Code
dat_run |> 
  ggplot(aes(x=cohort, y=score, group=Sex, color=Sex)) +
  geom_jitter() +
  geom_smooth(method="lm") +
#  geom_smooth(linetype="dashed") +
  scale_colour_manual("", values=c("red", "blue")) +
  scale_x_continuous("Cohort", breaks=seq(2011,2019)) +
  scale_y_continuous("Six-minute run [m]", breaks=seq(500,1500,100)) +
  geom_hline(yintercept=1008.11) +
  coord_cartesian(xlim=c(2011, 2019), ylim=c(500,1500)) +
  theme_bw() + theme(legend.position = "top") 
`geom_smooth()` using formula = 'y ~ x'

2.1.2 Multiple regression

Code
m1_run <-  lm(score ~ 1 + Sex + age + cohort, data=dat_run)
summary(m1_run)

Call:
lm(formula = score ~ 1 + Sex + age + cohort, data = dat_run)

Residuals:
    Min      1Q  Median      3Q     Max 
-445.48  -85.95    3.54   95.50  438.28 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13859.702   3742.467   3.703 0.000226 ***
SexBoys        64.205      9.484   6.770 2.32e-11 ***
age            33.270     16.626   2.001 0.045680 *  
cohort         -6.535      1.869  -3.496 0.000495 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 142.2 on 896 degrees of freedom
Multiple R-squared:  0.06351,   Adjusted R-squared:  0.06038 
F-statistic: 20.26 on 3 and 896 DF,  p-value: 1.046e-12
Code
check_model(m1_run)

Code
boxCox(m1_run)

2.2 Analysis of Star run: Need for reciprocal transformation

Code
set.seed(42)
dat_star <- 
  readRDS("data/EmotikonSubset.rds") |>
  mutate(Sex = fct_recode(Sex, Girls="female", Boys="male"),
         AgeBin = cut( age, breaks = c(7.8, seq(8.2, 9.0, 0.25), 9.3)), 
         cohort = as.numeric(as.character(Cohort))) |> 
  group_by(Test) |> 
  filter(Test=="Star_r") |>
  mutate(score = 51/score) |>  
  group_by(Cohort, Sex) |> 
  sample_n(50) |> 
  ungroup() |> 
  select(Cohort, School, Child, Sex, AgeBin, cohort, age, score)

m1_star <-  lm(score ~ 1 + age + Sex + cohort, data=dat_star)
summary(m1_star)

Call:
lm(formula = score ~ 1 + age + Sex + cohort, data = dat_star)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0146 -2.5613 -0.5408  1.9885 18.4222 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -199.25525  100.81191  -1.977  0.04840 * 
age           -1.33898    0.44055  -3.039  0.00244 **
SexBoys       -0.49168    0.25611  -1.920  0.05520 . 
cohort         0.11738    0.05032   2.333  0.01987 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.84 on 896 degrees of freedom
Multiple R-squared:  0.01821,   Adjusted R-squared:  0.01492 
F-statistic:  5.54 on 3 and 896 DF,  p-value: 0.0009052
Code
check_model(m1_star)

Code
boxCox(m1_star)

Tukey’s ladder of power transformations for lambdas: + lambda = -1, then 1/score
+ lambda = -1/2, then 1/sqrt(score) + [lambda = 0, then log(score)] – special case because sccore^0 = 1 + lambda = +1/2, then sqrt(score) + lambda = 1, then no transformation

Code
dat_star$score_r = 51/dat_star$score

m1_star_r <-  lm(score_r ~ 1 + age + Sex + cohort, data=dat_star)
summary(m1_star_r)

Call:
lm(formula = score_r ~ 1 + age + Sex + cohort, data = dat_star)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.91937 -0.18806  0.00168  0.18231  0.92385 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 19.102636   7.521905   2.540  0.01127 * 
age          0.107257   0.032871   3.263  0.00114 **
SexBoys      0.043998   0.019109   2.302  0.02154 * 
cohort      -0.008935   0.003754  -2.380  0.01752 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2865 on 896 degrees of freedom
Multiple R-squared:  0.02158,   Adjusted R-squared:  0.01831 
F-statistic: 6.589 on 3 and 896 DF,  p-value: 0.0002092
Code
check_model(m1_star_r)

Code
boxCox(m1_star_r)

3 Factors and covariates

The following analysis assume a balanced design, that is equal number of children in each of the Sex x Cohort cells of the design.

Contrasts are needed for factors; some form of centering is often needed for covariates.

3.1 Factor Sex

3.1.1 Treatment coding

If you don’t specify a contrast, R will specify a treatment contrast for you. This is the default setting.

contrasts(dat_run$Sex) <- contr.treatment(2)
Code
contrasts(dat_run$Sex)
      Boys
Girls    0
Boys     1
Code
m2_run_trt <-  lm(score ~ 1 +  Sex, data=dat_run)
summary(m2_run_trt)

Call:
lm(formula = score ~ 1 + Sex, data = dat_run)

Residuals:
    Min      1Q  Median      3Q     Max 
-448.49  -90.74    6.39   95.28  459.51 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  975.722      6.748 144.597  < 2e-16 ***
SexBoys       64.767      9.543   6.787 2.08e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 143.1 on 898 degrees of freedom
Multiple R-squared:  0.04879,   Adjusted R-squared:  0.04773 
F-statistic: 46.06 on 1 and 898 DF,  p-value: 2.08e-11
Code
(tbl_sex <- dat_run |> group_by(Sex) |> summarise(N=n(), M=mean(score), SD=sd(score)))
# A tibble: 2 × 4
  Sex       N     M    SD
  <fct> <int> <dbl> <dbl>
1 Girls   450  976.  134.
2 Boys    450 1040.  152.
Code
(Boys_Girls <- tbl_sex$M[2] - tbl_sex$M[1])   # Boys - girls
[1] 64.76667
Code
(GrandMean <- mean(dat_run$score))
[1] 1008.106
  • (Intercept) is mean of group with level 0: Girls.
  • Sex2 is difference of Boys’ and Girls’ means.

Note: When the data are balanced, you don’t need multiple regression (MR) to estimate effects; you need MR to estimate SEs of effects, that is to test the significance of the difference.

3.1.2 Effect coding

Code
contrasts(dat_run$Sex) <- contr.sum(2)

m2_run_sum <-  lm(score ~ 1 +  Sex, data=dat_run)
summary(m2_run_sum)

Call:
lm(formula = score ~ 1 + Sex, data = dat_run)

Residuals:
    Min      1Q  Median      3Q     Max 
-448.49  -90.74    6.39   95.28  459.51 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1008.106      4.771 211.277  < 2e-16 ***
Sex1         -32.383      4.771  -6.787 2.08e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 143.1 on 898 degrees of freedom
Multiple R-squared:  0.04879,   Adjusted R-squared:  0.04773 
F-statistic: 46.06 on 1 and 898 DF,  p-value: 2.08e-11
Code
tbl_sex 
# A tibble: 2 × 4
  Sex       N     M    SD
  <fct> <int> <dbl> <dbl>
1 Girls   450  976.  134.
2 Boys    450 1040.  152.
Code
GrandMean
[1] 1008.106
Code
(DistFromGM <- (tbl_sex$M[2] - tbl_sex$M[1])/2)   # Boys
[1] 32.38333
  • (Intercept) is Grand Mean
  • Sex1 is distance of Boys from Grand Mean (same absolute difference for Girls).

3.1.3 Sequential difference coding (“sliding contrast”)

Code
contrasts(dat_run$Sex) <- MASS::contr.sdif(2)  # contr.sdif() "lives" in package MASS

m2_run_sqd <-  lm(score ~ 1 + Sex, data=dat_run)
summary(m2_run_sqd)

Call:
lm(formula = score ~ 1 + Sex, data = dat_run)

Residuals:
    Min      1Q  Median      3Q     Max 
-448.49  -90.74    6.39   95.28  459.51 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1008.106      4.771 211.277  < 2e-16 ***
Sex2-1        64.767      9.543   6.787 2.08e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 143.1 on 898 degrees of freedom
Multiple R-squared:  0.04879,   Adjusted R-squared:  0.04773 
F-statistic: 46.06 on 1 and 898 DF,  p-value: 2.08e-11
Code
tbl_sex 
# A tibble: 2 × 4
  Sex       N     M    SD
  <fct> <int> <dbl> <dbl>
1 Girls   450  976.  134.
2 Boys    450 1040.  152.
Code
GrandMean
[1] 1008.106
Code
(Boys_Girls <- tbl_sex$M[2] - tbl_sex$M[1])   # Boys - girls
[1] 64.76667
  • (Intercept) is Grand Mean
  • Sex2-1 is difference of Boys’ and Girls’ means.

3.2 Covariate cohort

Now we model the effect of the covariate cohort.

Code
contrasts(dat_run$Sex) <- MASS::contr.sdif(2)
m3_run <-  lm(score ~ 1 + cohort, data=dat_run)
summary(m3_run)

Call:
lm(formula = score ~ 1 + cohort, data = dat_run)

Residuals:
    Min      1Q  Median      3Q     Max 
-437.76  -95.42    3.07   98.38  468.58 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12752.533   3797.832   3.358 0.000819 ***
cohort         -5.828      1.885  -3.092 0.002047 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 146 on 898 degrees of freedom
Multiple R-squared:  0.01054,   Adjusted R-squared:  0.009435 
F-statistic: 9.563 on 1 and 898 DF,  p-value: 0.002047

The (Intercept) estimate looks very different. It is the Grand Mean when cohort has the value 0, that is around the first Christmas in history. This is not very plausible, or course.

Code
dat_run |> 
  ggplot(aes(x=cohort, y=score)) + 
  geom_smooth(method=lm, fullrange=TRUE) + 
  scale_x_continuous("Cohort [year]", limits=c(0,2020), breaks=seq(0,2000, 250)) +
  scale_y_continuous("Six-minute run [m]", breaks=seq(0,20000,1000)) +
  theme_bw()
`geom_smooth()` using formula = 'y ~ x'

That’s why it is a good idea to center covariates at a meaningful value.

Code
contrasts(dat_run$Sex) <- MASS::contr.sdif(2)
dat_run$cohort_c <- dat_run$cohort - 2015

m3b_run <-  lm(score ~ 1 + cohort_c, data=dat_run)
summary(m3b_run)

Call:
lm(formula = score ~ 1 + cohort_c, data = dat_run)

Residuals:
    Min      1Q  Median      3Q     Max 
-437.76  -95.42    3.07   98.38  468.58 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1008.106      4.866 207.153  < 2e-16 ***
cohort_c      -5.829      1.885  -3.092  0.00205 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 146 on 898 degrees of freedom
Multiple R-squared:  0.01054,   Adjusted R-squared:  0.009435 
F-statistic: 9.563 on 1 and 898 DF,  p-value: 0.002047
Code
# rescaling does not change the goodness of fit
anova(m3_run, m3b_run)
Analysis of Variance Table

Model 1: score ~ 1 + cohort
Model 2: score ~ 1 + cohort_c
  Res.Df      RSS Df  Sum of Sq F Pr(>F)
1    898 19140278                       
2    898 19140278  0 1.5646e-07         
  • (Intercept): GM in year 2015 is equal to the GM of all the data because I have balanced set of data and because 2015 is in the middle of cohort.
  • cohort_c: change in dependent variable when covariate is increase by 1 unit (year in this example)

3.3 Factor Cohort

3.3.1 Treatment coding

Code
contrasts(dat_run$Cohort) <- contr.treatment(9, base=1)

m4_run_trt <-  lm(score ~ 1 + Cohort, data=dat_run)
summary(m4_run_trt)

Call:
lm(formula = score ~ 1 + Cohort, data = dat_run)

Residuals:
    Min      1Q  Median      3Q     Max 
-424.48  -98.67    1.09   99.55  455.62 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1044.38      14.59  71.583  < 2e-16 ***
Cohort2       -27.71      20.63  -1.343  0.17962    
Cohort3       -45.93      20.63  -2.226  0.02626 *  
Cohort4       -34.39      20.63  -1.667  0.09592 .  
Cohort5       -25.00      20.63  -1.212  0.22597    
Cohort6       -14.90      20.63  -0.722  0.47040    
Cohort7       -58.18      20.63  -2.820  0.00491 ** 
Cohort8       -53.61      20.63  -2.598  0.00952 ** 
Cohort9       -66.75      20.63  -3.235  0.00126 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 145.9 on 891 degrees of freedom
Multiple R-squared:  0.01954,   Adjusted R-squared:  0.01073 
F-statistic: 2.219 on 8 and 891 DF,  p-value: 0.02411
Code
(tbl_chrt <- dat_run |> group_by(Cohort) |> summarise(N=n(), M=mean(score), SD=sd(score)))
# A tibble: 9 × 4
  Cohort     N     M    SD
  <fct>  <int> <dbl> <dbl>
1 2011     100 1044.  144.
2 2012     100 1017.  145.
3 2013     100  998.  153.
4 2014     100 1010.  145.
5 2015     100 1019.  122.
6 2016     100 1029.  150.
7 2017     100  986.  154.
8 2018     100  991.  134.
9 2019     100  978.  163.
  • (Intercept) is mean at first level (i.e., 2011 in the example)
  • Cohort2 is 2012 - 2011
  • Cohort3 is 2013 - 2011

With the base argument you can set a different level as the reference level.

3.3.2 Sequential difference coding

Code
contrasts(dat_run$Cohort) <- MASS::contr.sdif(9)

m4_run_sqd <-  lm(score ~ 1 + Cohort, data=dat_run)
summary(m4_run_sqd)

Call:
lm(formula = score ~ 1 + Cohort, data = dat_run)

Residuals:
    Min      1Q  Median      3Q     Max 
-424.48  -98.67    1.08   99.55  455.62 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1008.106      4.863 207.289   <2e-16 ***
Cohort2-1    -27.710     20.633  -1.343   0.1796    
Cohort3-2    -18.220     20.633  -0.883   0.3775    
Cohort4-3     11.540     20.633   0.559   0.5761    
Cohort5-4      9.390     20.633   0.455   0.6492    
Cohort6-5     10.100     20.633   0.490   0.6246    
Cohort7-6    -43.280     20.633  -2.098   0.0362 *  
Cohort8-7      4.570     20.633   0.221   0.8248    
Cohort9-8    -13.140     20.633  -0.637   0.5244    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 145.9 on 891 degrees of freedom
Multiple R-squared:  0.01954,   Adjusted R-squared:  0.01073 
F-statistic: 2.219 on 8 and 891 DF,  p-value: 0.02411
Code
GrandMean
[1] 1008.106
Code
tbl_chrt
# A tibble: 9 × 4
  Cohort     N     M    SD
  <fct>  <int> <dbl> <dbl>
1 2011     100 1044.  144.
2 2012     100 1017.  145.
3 2013     100  998.  153.
4 2014     100 1010.  145.
5 2015     100 1019.  122.
6 2016     100 1029.  150.
7 2017     100  986.  154.
8 2018     100  991.  134.
9 2019     100  978.  163.
  • (Intercept) is Grand Mean
  • Cohort2 is 2012 - 2011
  • Cohort3 is 2013 - 2012
  • Cohort4 is 2014 - 2013

It is interesting to compare m3_run and m4_run_sqd.

Code
anova(m3_run, m4_run_sqd)
Analysis of Variance Table

Model 1: score ~ 1 + cohort
Model 2: score ~ 1 + Cohort
  Res.Df      RSS Df Sum of Sq      F Pr(>F)
1    898 19140278                           
2    891 18966199  7    174080 1.1683 0.3184

Thus, it was a bit of waste to spend eight degrees of freedom for Cohort. We can get the same goodness of fit with 1 df for cohort.

3.3.3 Hypothesis coding

The most common contrasts [contr.sum(), MASS::contr.sdif(), contr.treatment()] are available as R functions. If you want to set up your own contrast, not covered by the pre-programmed ones, see:

Both papers are also in the DSSS/literature folder.

4 Appendix

Code
sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] hypr_0.2.8         emmeans_1.10.7     performance_0.14.0 car_3.1-3         
 [5] carData_3.0-5      lubridate_1.9.4    forcats_1.0.0      stringr_1.5.1     
 [9] dplyr_1.1.4        purrr_1.0.4        readr_2.1.5        tidyr_1.3.1       
[13] tibble_3.2.1       ggplot2_3.5.2      tidyverse_2.0.0   

loaded via a namespace (and not attached):
 [1] utf8_1.2.5         generics_0.1.3     stringi_1.8.7      lattice_0.22-7    
 [5] pracma_2.4.4       hms_1.1.3          digest_0.6.37      magrittr_2.0.3    
 [9] evaluate_1.0.3     grid_4.5.0         timechange_0.3.0   estimability_1.5.1
[13] RColorBrewer_1.1-3 mvtnorm_1.3-3      fastmap_1.2.0      Matrix_1.7-3      
[17] jsonlite_2.0.0     ggrepel_0.9.6      Formula_1.2-5      mgcv_1.9-3        
[21] scales_1.4.0       abind_1.4-8        cli_3.6.5          rlang_1.1.6       
[25] splines_4.5.0      withr_3.0.2        yaml_2.3.10        datawizard_1.1.0  
[29] tools_4.5.0        tzdb_0.5.0         see_0.11.0         coda_0.19-4.1     
[33] bayestestR_0.16.0  vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.4   
[37] htmlwidgets_1.6.4  MASS_7.3-65        insight_1.3.0      pkgconfig_2.0.3   
[41] pillar_1.10.2      gtable_0.3.6       Rcpp_1.0.14        glue_1.8.0        
[45] xfun_0.52          tidyselect_1.2.1   rstudioapi_0.17.1  knitr_1.50        
[49] farver_2.1.2       xtable_1.8-4       patchwork_1.3.0    nlme_3.1-168      
[53] htmltools_0.5.8.1  labeling_0.4.3     rmarkdown_2.29     compiler_4.5.0